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1. INTRODUCTION 

Human behavior or negligence has impacted significant changes in the global ecosystem. The 
changes in the global environment have resulted in the emergence of pandemic threats caused by existing 
infectious diseases, virus mutations, or new diseases. The pathogen intrusion has evolved rapidly and 
threatened the human population through various infectious diseases. To tackle the threats, scientists and 
policymakers have conducted some action to prevent the spread of the disease by reducing the risks [1]. 
Moreover, unhealthy lifestyles have influenced human body conditions that may lead to internal and genetic 
diseases. World Health Organization (WHO) reported that heart disease, pulmonary and stroke are the top 
three leading causes of death globally [2]. Hence, the emergence of COVID-19 has worsened human health 
conditions and changed health data globally. The COVID-19 pandemic has affected various life aspects and 
still poses many challenges due to the virus mutation [3]. This phenomenon has accelerated research in 
epidemiology to study the pattern of disease spread or health-related occurrence and the factors that can 
influence the disease. In recent days of the pandemic, this knowledge has been beneficial in mapping the 
spread pattern of COVID-19. 
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Over the years, the study of epidemiology has evolved, and a new paradigm has developed both in 
public health services and scientific research. The data analysis methods are improved, requiring a multi- 
disciplinary approach to overcome various disease outbreaks. More and more advanced technology has been 
developed for epidemiology to any levels of analysis from the population, individuals, organs, cells to DNA, 
to elaborate on many concepts of health and diseases [4]. The development of information and 
communication technology (ICT) has speeded up the data analytic processes and increased the datasets from 
disparate data sources in the healthcare sector. Artificial intelligence has enabled the computer to imitate 
human intelligence and process huge amounts of data beyond human capability. The advances of artificial 
intelligence (AI) have resulted in tremendous machine learning approaches developed to achieve automated 
analysis in supporting the real-time decision-making process and discovering solutions to complex healthcare 
problems by learning the patterns through algorithms and statistical models. The artificial intelligence 
community developed machine learning by utilizing statistical methods to solve many research problems. AI 
and machine learning have revolutionized the transition from traditional to modern epidemiology by 
delivering solutions for the analysis of complex clinical data applicable in many applications, including 
disease diagnosis, drug repurposing, and discovery, personalized health treatment, health risk identification, 
outbreak prediction, and intelligent health system development [5], [6]. 

The relevance of this review study is intended to explore the evolution of machine learning 
application in supporting the scientific discipline of epidemiology. Bibliometric analysis was first conducted 
to get insight into the scientific mapping of the research domain. Further, systematic content analysis was 
performed to deepen the analysis of the contributions of machine learning in tackling the disease outbreaks. 
We address that this work has the following contributions: 1) present an overview of the role of machine 
learning in handling disease outbreaks; 2) describe the development and state-of-the-art of this domain 
research through the network and evolution keywords analysis; and 3) present the insight of infectious 
diseases that get thoughtful attention from researchers and the summarized view of the advanced techniques 
of machine learning used for solving diseases outbreak problems through content analysis. 

This article's organizational structure starts with the description of the context and the research 
relevance mentioned in section 1. Previous works related to the topic of this article were presented in section 
2, including the research gap that may be filled by the study. Section 3 describes the study workflow and 
materials used in this review. The results of the studies are discussed in section 4. At the beginning of section 
4, some bibliometric measurements are explained in two subsections: scientific production and research 
interest and evolution. The last part of section 4 describes 12 topic hotspots selected through the topic 
modeling process. These topics are categorized into three groups, namely COVID-19 disease, miscellaneous 
diseases, and public opinion on disease outbreaks. Finally, the limitations and conclusion of this work will be 
drawn in section 5 and section 6. 


2. RELATED WORKS 

The growing use of machine learning in the health sector has encouraged the development of health 
science in dealing with disease outbreaks. Advances in AI and machine learning (ML) have helped society 
solve global health challenges and accelerated the achievement of sustainable healthcare. Research on this 
technology has been done for the health sector with various objectives, including disease diagnosis, health 
risk assessment, prediction and surveillance of disease outbreaks, and health management strategies [7]-[9]. 
The emergence of AI-based digital epidemiology supported by machine learning has dramatically contributed 
to the improvement of public health and disease outbreak handling [10], [11]. Many previous studies 
examining the application of machine learning in the health sector have been carried out. The emergence of a 
pandemic outbreak has accelerated the development of machine learning for epidemiology, the study dealing 
with disease transmission, and the factors that influence the disease. The following paragraphs will discuss 
the previous works related to this review study. 

Research on the spread of infectious diseases has increased since the COVID-19 pandemic struck 
globally. Alfred and Obit [12] discussed the role of machine learning in handling the disease outbreak 
attempting to reduce its spreading. The study focused on detecting and predicting disease attacks by applying 
various classification and prediction models using structured and unstructured data. A systematic review of 
surveillance systems using social media data for similar purposes was conducted by [13]. Moreover, the 
development of text mining and the increasing use of social media deliver research opportunities to 
implement machine learning for health computing, covering the identification of risk factors and symptoms, 
health crisis phases, and public health responses to pandemics [14]-[16]. 

Some disease outbreaks were examined in epidemiological areas of research. Sak and Suchodolska 
[17] reviewed the potential application of machine learning for nutritional epidemiology to identify the 
influence of nutrients on the health and disease of the human body. Basu et al. [18] discussed the application 
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of machine learning for diabetes clinical epidemiology to improve risk stratification and prediction. Special 
issues of review studies have been conducted for particular diseases, such as cancer risk assessment [19], 
mosquito-borne disease transmission [20], COVID-19 diagnosis [21], and others. 

From the analysis of previous studies, we capture that the existing research discusses the use of ML 
for specific disease problems or certain purposes such as disease detection, risk prediction, and spread 
estimation. There are still few studies that discuss the application of ML in dealing with disease outbreaks 
from a helicopter view. Our research examines a broader perspective than what has been done in previous 
studies to fill the gap. We explore the following issues: the evolution of research topics, research advances in 
infectious diseases, and the machine learning model that is currently being used. 


3. MATERIALS AND METHOD 

This review study used the Scopus citation database records as bibliographic data for analysis. 
Scopus was chosen because it covers a more expanded spectrum and a superior number of peer-reviewed 
publications and provides reliable bibliographic data [22]. The study was carried out in three stages: study 
design, data preparation, and data analysis. The study workflow is presented in Figure 1. At first, the study 
design stage determined the citation database and software tools objectives of the study. The tools were 
employed to devise the scientific mapping of the research domain. Then, we fixed searching keywords and 
inclusion/exclusion criteria to select the desired bibliometric data. 
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Figure 1. Study workflow 


The bibliographic data were collected by employing two sets of terms related to the study domain. 
The first set contains 'machine learning" and ‘deep learning’ search keywords. The keyword ‘deep learning 
was included because this is an advanced form of machine learning that is currently growing due to the 
increasing data variety and volume. We used additional search keywords of ‘pandemic’, 'epidemic’, ‘endemic’, 
and ‘disease outbreak’ to enrich the epidemiology term. 

The bibliometric data were collected from the Scopus citation database using the determined 
keywords derived from the previous stage in the early data preparation stage. The data collection process 
resulted in 4,062 records of published articles. We applied some inclusion criteria for data selection, taking in 
articles published in 2000-2021 and written in English. Only publications in the form of journal papers, 
books, book chapters, conference papers, and reviews were included for analysis. Duplicated and incomplete 
records were removed from the dataset. Then we selected articles that were concerned with human diseases 
and excluded animal and plant diseases related articles. The details of the data selection process is presented 
in Figure 2. At the end of the data preparation process, 3,447 articles were selected for the next stage. 

The bibliometric analysis utilized the Bibliometrix software package and Biblioshiny tools running 
under the R environment. The package provides tools for quantitative analysis for examining and visualizing 
bibliometric data. We conducted descriptive and network analysis in the last stage to develop knowledge 
maps and the research field's conceptual and intellectual structures [23]. Then, a machine learning approach 
called latent dirichlet allocation (LDA) was executed in this content analysis to explore the most prominent 
topics and clustered them into several issues. This approach was performed by extracting the abstracts of all 
articles that work based on the distributions of words. Using LDA, similar topics were examined across 
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articles and clustered by words that appeared salient in the dataset. We selected clusters that can shape issues 
related to the research domain. At the end of the stage, a complete reading of the articles was conducted to 
determine the proper topic hotspot for a particular cluster for data analysis. 
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Figure 2. Selection process 


4. RESULTS AND DISCUSSION 
4.1. Scientific production 

The dataset consists of 3,447 publications that are more than ninety per cent dominated by journal 
articles and proceeding papers, with few publications in books or book chapters. Figure 3 plots the evolution 
of the publications in this research area during the last two decades. In this period, it has been recorded in the 
history of epidemiology that there has been occurred several infectious disease outbreaks, including human 
immunodeficiency virus (HIV), hemagglutinin! and neuraminidase! (H1N1) Influenza, severe acute 
respiratory syndrome (SARS), middle east respiratory syndrome (MERS), Dengue, Chikungunya, and Zika 
[1] in addition to COVID-19. The figure shows that the volume of publications grew constantly and steadily 
between 2000 and 2012 and slightly increased starting from 2013. The increase was driven by the emergence 
of outbreaks of MERS, Ebola, and Zika. After Coronavirus was founded for the first time at the end of 2019, 
the trend had exponential growth. The COVID-19 pandemic has accelerated academic research in this field, 
yielding almost two-thirds of the articles in the dataset. 


Annual scientific production 


1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 


Figure 3. Annual production of publication 


Scopus categorized articles in several subject areas. An article can be classified in more than one 
subject area, and it will be counted individually for each field. The top 15 subject areas with the most articles 
in the data set are shown a Table | arranged in descending order. The three major subject areas are computer 
science, medicine and engineering contributing more than 50% of the total articles. The rest of the articles are 
spread over various subject areas. Machine learning was developed by the community engaged in computer 
science and engineering. Hence as shown in the table, this subject area has contributed more than one-third to 
this study domain. It was noted that IEEE Access had published the most articles related to this subject area. 
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Table 1. Most research areas according to the number of articles 


Subject Area Number of articles _ Percentage (%) 
Computer Science 1,672 21.98 
Medicine 1,380 18.14 
Engineering 892 11.73 
Biochemistry, Genetics & Molecular Biology 495 6.50 
Mathematics 494 6.49 
Decision Sciences 322 4.23 
Environmental Science 248 3.26 
Physics and Astronomy 246 3.23 
Social Sciences 222 2.91 
Agricultural and Biological Sciences 220 2.89 
Material Science 195 2.56 
Multidisciplinary 192 2.52 
Immunology and Microbiology 134 1.76 
Health Professions 133 1.74 
Energy 107 1.40 


4.2. Research interest and evolution 

Research interest evolution begins by analyzing the occurrence of keywords, titles' words, and 
abstracts words that the authors most often used. The analysis results show that of the 6,855 author keywords 
in the dataset, two author keywords appear more than a thousand times, namely "machine learning” and 
"COVID-19". Regarding the topic of epidemiology and the world situation in the last two years, several 
authors use the terms "SARS-COV2" and "coronavirus" which refer to COVID-19 as frequently used 
keywords. Authors’ keywords that occur frequently are also shown in Table 2. Figure 4 shows the word 
cloud that presents the most used author keywords indicated by text size. We can see that convolution neural 
network (CNN), long short term memory (LSTM), random forest, and support vector machine (SVM) are the 
authors’ machine learning algorithms more widely used. Data mining techniques of prediction appeared 
prominent in the word cloud, including classification, sentiment analysis, and forecasting. Moreover, natural 
language processing (NLP), the branch of AI developed to process text or voice data, is also widely used in 
this study domain. 


Table 2. Most frequently used words and keywords in the literature 


Author keywords n Words in titles n 
machine learning 1,034 covid 1,401 
COVID-19 1,028 learning 1,280 
deep learning 571 machine 774 
artificial intelligence 257 based 557 
coronavirus 181 deep 505 
epidemiology 156 data 410 
sars-cov2 142 prediction 351 
pandemic 121 detection 321 
classification 98 analysis 314 
prediction 96 pandemic 271 
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Figure 4. Word cloud 


Research themes can be identified by analyzing a co-word network, a network analysis that aims to 
develop a conceptual structure of a particular scientific domain by mapping and clustering terms extracted 
from the collection of text data [23]. Figure 5 shows the co-word network using the author's keywords as the 
unit of analysis to build the conceptual structure of the dataset. There are 6,855 keywords identified that 
would be too many to fit on a chart. So, only 50 nodes with a three-occurrence threshold were set to get a 
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readable chart. The thickness of the connecting line indicates a proximity measure of the association of two 
keywords. The node size indicates the weight of the occurrence of the word. 

The keywords are grouped into four interrelated clusters. The green "COVID-19" and the red 
"machine learning" have the largest node sizes, indicating that they are the most frequently occurring 
keywords. They also have the highest proximity, which indicates how often these keywords occur together. 
The red cluster consists of 15 words, the majority are machine learning techniques and methods, and the 
others are related to infectious diseases and outbreaks such as "dengue", "COVID-19 pandemic", and 
"epidemiology". The closest connections to the "machine-learning" node are the keywords "prediction", 
"classification" and "random forest". 
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Figure 5. Co-word network 


The green cluster corresponds to COVID-19 disease that can be divided into three sub-clusters. The 
first sub-cluster concerns data processing, including "data mining", "sentiment analysis", "natural language 
processing", "big data", and "text mining". The keywords "social media" and "Twitter" can be interpreted as the 
most widely used data sources. The last sub-group is related to pandemics such as "infodemiology”", 
"surveillance", "epidemiology", "mental health", and "public health". Furthermore, we can see that "big data" 
has a strong relationship with "has a strong relationship with "machine learning" and COVID-19 in this domain. 

The blue cluster has the highest number of keywords consisting of 16 terms. The keywords were 
more related to the diagnosis of COVID-19 using deep learning interpreted by the words "coronavirus", 
"computer vision", "image processing", "computed tomography", "pneumonia", "x-and ray", and others. It is 
identified that "convolutional neural networks" and "transfer learning" are associated more strongly with "deep 
learning" in addition to "coronavirus". Finally, the purple cluster is the smallest cluster on the co-word chart. 
In this cluster, "artificial intelligence" is the dominant keyword and associated closest with "pandemic". The 
keywords "health care", "social distancing," and "internet of things" were included in the cluster. 

Co-word network analysis showed that machine learning has contributed to research in handling 
infectious diseases in the last two decades. Research topics of prediction, forecasting, classification, and 
diagnosis are the most frequently conducted in previous research. Classical algorithms such as the super 


vector machine and random forest were still interesting and used to solve many medical problems. The high 


Bulletin of Electr Eng & Inf, Vol. 11, No. 4, August 2022: 2169-2186 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 O 2175 


ability of deep learning to provide a transferable solution has driven the research in this domain to cover 
more complex problems. The convolution neural network algorithm and its advances have been widely used 
for analyzing medical images. In addition to medical records and image analysis, machine learning was also 
used for natural language processing and text mining for analysis of opinions posted on social media related 
to public health and mental health that were prominent when the world was facing the pandemic. 

Figure 6 demonstrates the evolution of keywords and research direction over several periods. We 
divided the observed period into the time spans of 2000-2009, 2010-2019, 2020, and 2021. We considered 
the last two periods by allowing only one year because, during these two periods, the articles discussing 
machine learning for epidemiology were growing exponentially due to the COVID-19 pandemic. In the early 
period of 2000-2009, the Bayesian network and neural network were widely used in the research. Epistasis 
became a topic of interest in that period that elaborated on the interaction between different genes applied to 
molecular and quantitative genetics research for exploring complex diseases [24]. In this period, researchers 
were more concerned with epistasis and genetic problems. 
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Figure 6. Evolution of keywords 


During the period 2010-2019, logistic regression and random forest are widely applied in the 
research in addition to the artificial neural network. Analysis shows that many studies used data mining 
techniques of classification, clustering, and anomaly detection on health and bio-sciences. Deep learning and 
reinforcement learning have also facilitated the studies to process multimodal data from large and complex 
datasets. Breast cancer and dengue were two diseases that got more attention in machine learning research. 
The blockchain technology that was growing in the last of this period was applied to develop varied 
intelligent healthcare systems. Additionally, social media data was becoming a prominent source for disease 
surveillance. 

When COVID-19 was identified as a global pandemic in 2020, the research on machine learning for 
this deadly virus was increased dramatically to date. Traditional machine learning and deep learning were 
still widely employed in COVID-19 research. The existence of immense data, decentralized in many 
distributed locations or remote devices, has encouraged researchers to find proper methods to cover many 
pandemic issues by harnessing the data. Federated learning is a machine learning method that involves 
statistical models trained over remote devices or soiled data centers while keeping data localized. This 
method is a distributed machine learning approach that enables models trained on a large corpus of 
decentralized data. 
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The analysis of the keyword evolution showed that the application of bioinformatics using machine 
learning became increasingly advanced after 2010. In these early years, research on cancer and dengue was 
prominent. The application of machine learning to treat cancer disease was concerned about the patient's 
survival. While dengue research focused on landscape epidemiology and the effects of climate change on the 
disease. Since the COVID-19 outbreak occurred, machine learning algorithms have been increasingly applied 
to deal with disease treatments and reduce the risk of comorbidities caused by hypertension, diabetes, cancer, 
and others. Because the symptoms of COVID-19 are almost similar to those of influenza, machine learning 
research on influenza has been continuously conducted to devise patient treatments and to discover drugs and 
vaccines for COVID-19. During this pandemic, research has also paid attention to the early warning of 
disease spread and health policy to prevent the wider spread of this outbreak. 


4.3. Topic hotspots 

The last stage of the study is to perform content analysis by applying LDA to the abstracts collected from 
the dataset. This machine learning-based text mining technique was developed to uncover hidden thematic 
structures and find the topic hotspots of the domain. This unsupervised probabilistic method employed a 
hierarchical Bayesian model providing a set of topics and the probabilities that represent the topics’ strengths across 
the dataset [25]. This content analysis is intended to shed light on the main research areas that get more attention. 
Therefore, the research landscape can be exposed so that the future direction of the research can be drawn. 

Initially, the analysis was designed to generate 20 topics from the abstracts of selected documents. 
After cleaning and stemming the dataset, the LDA model yielded topics that must be reshaped to get some 
prominent themes. We identified 12 topics regarding the domain of the study, as listed in Table 3. Some 
closest topics were merged due to their similar themes, and some others were removed because they were too 
weak to build a particular theme. The extracted topics were then distinguished into three clusters: 1) COVID- 
19 disease; 2) Miscellaneous diseases; and 3) public opinion on disease outbreaks. Content analysis was 
conducted to select the eligible documents that were strongly related to the identified topics. The following 
subsections discuss these clusters. 


Table 3. Related topics covered by the selected articles 


Topic Top 20 prominent words 
Cluster 1: COVID-19 disease 
learning 

1 Case trend and spread covid, patient, prediction, model, mortality, risk, hospital, severe, spread, disease, 

prediction infectious, confirm, clinic, death, identification, trend, forecast, outbreak, rate, and increase 

2: Disease detection and health covid, health, care, monitor, digital, service, device, IoT, solution, intelligence, sensor, 

monitoring smart, mobile, collect, medical, community, remote, real, time, and internet. 

3 Medical imaging diagnosis covid, image, model, detection, x-ray, cxr, accuration, disease, diagnosis, chest, patient, 
classification, pneumonia, case, infection, severe, medical, scan, screening, and spread. 

4 Medicine, vaccine and covid, disease, drug, development, intelligence, prediction, medical health, new, model, 

immunity spread, analysis, vaccine, effect, advance, world, virus, treatment, emergency, and clinic. 

5 Viral infection covid, antibody, patient, resistance, sequence, viral, response, severe, biomark, protein, 
pathogen, bacteria, immune, genome, antigen, detection, phenotype, strain, antibiotic, and 
infection. 

6 Health protocol and policies covid, image, mask, detection, classification, human, face, infection, risk, distance, social, 


control, model, identification, time, people, lockdown, policy, government, and spread. 
Cluster 2: Miscellaneous disease 


7 Internal disease and risks diabetic, risk, disease, liver, heart, cardiovascular, biomark, ecg, retina, age, organ, 
transplantation, factor, kidney, obesity, nutrition, eye, tissue, metabolism, and diet. 

8 Cancer disease cancer, disease, patient, clinic, prediction, survival, classification, risk, age, treatment, 
symptom, severe, observation, breast, death, care, brain, mortality, diagnosis, and 
epidemiology. 

9 Influenza disease factor, classification, prediction, influenza, health, flu, host, risk, human, anxiety, disease, 
country, pandemic, season, transmission, outbreak, area, virus, public, and environment. 

10 Mosquito-borne disease dengue, malaria, prediction, epidemic, blood, patient, analysis, climate, factor, spread, 
detection, fever, diagnostic, effect, classification, infection, incident, region, mosquito, and 
identification. 

11 HIV and drug abuse risk, HIV, association, estimation, identification, smoke, population, factor, effect, health, 
age, activity, birth, prediction, suicide, social, drug overdose, behavior, and opioid. 

12 Cluster 3: Public opinion on health, social, tweet, media, public, covid, twitter, topic, identification, opioid, sentiment, 

disease outbreaks pandemic, emotion, disease, analysis, online, surveillance, news, concern, and mental 


4.3.1. COVID-19 disease 
The COVID-19 outbreak has opened up many research opportunities in many multi-discipline 
studies. Approximately 45% of the total articles in the dataset cover COVID-19 published in 2020-2021. Six 
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topics related to the COVID-19 disease cluster were obtained from the results of LDA topic modelling, 
namely: 1) case trend and spread prediction; 2) disease detection and health monitoring; 3) medical imaging 
diagnosis; 4) medicine, vaccine and immunity; 5) viral infection; and 6) health protocol and policies. 

The studies on the spread of novel coronavirus in the early phase of the pandemic mainly focused on 
predicting new cases, recovery, and mortality, for global [26], [27] or country-specific scope [28]. Severity 
analysis of contaminated areas was also a concern for researchers [29]. Risk assessment and mapping for 
disaster management towards the pandemic were analyzed to prevent wider transmission and reduce the 
number of cases [30]. Regression and time series models were dominated in the previous studies 
complemented by statistical and susceptible, infected, and recovered (SIR) mathematical models [31] for 
dynamic or real-time prediction/forecasting purposes. Overall, the survey identified 344 articles in topic 1 
discussing the coronavirus spread from the dataset. The journal of Chaos, Solitons, and Fractal were the most 
prolific journal on this topic. 

Early detection of COVID-19 and health monitoring of patients are urgent issues to be concerned by 
health service authorities in order to isolate infected people, provide timely treatment [32], and also to 
prevent the spread of the disease [30]. There were 210 articles identified from the dataset discussing this 
topic. The detection of infected patients is very important for some prediction purposes, including patients 
with highly required immediate respiratory support [33], patients with a high risk of mortality [34], the 
mortality rate in comorbidities patients [35], patients’ severity risk [36], and required intensive care unit 
(ICU) admission for high-risk patients [37]. A specimen test with real-time polymerase chain reaction (RT- 
PCR) is currently recognized as the standard test for confirmation of SARS-CoV-2 infection, delivering the 
highest accuracy for infectious detection compared to other testing methods [38]. However, this method still 
had some limitations because the test relies on a sample with a relatively high false-negative ratio and its 
expensive cost for countries with a lack of sufficient resources [39]. Previous studies on COVID-19 detected 
the virus infection based on data analysis of chest X-rays (CXR) and computerized tomography (CT) images 
[40], patients’ voice, cough and breathing patterns [41], [42], electronic health records (EHR) [37], blood test 
results [43], and serum samples [38], [44]. 

AI, machine learning, and internet of thing (IoT) have been widely used for health monitoring 
during the SARS-CoV-2 outbreak. Telemedicine and telehealth could be adopted to monitor patients 
remotely, especially patients with medical emergencies. The human health condition can be monitored by 
tracking the health data such as body temperature, cough rate, respiratory rate, and blood oxygen saturation 
through an intelligent smartphone application that applied fog-based machine learning for diagnosis purposes 
[45]. Furthermore, the increasing number of COVID-19 patients has significantly impacted the needs of 
health facilities. Health care providers must manage overload conditions and allocate sufficient equipment 
during this emergency situation. Machine learning contributed to predicting the needs of hospital facilities 
such as ambulances availability [46], the required number of beds and mechanical ventilators [33], [47], 
hospital hygiene required [48], and optimization of critical medical supplies redistribution [49]. The most 
prolific journals for this topic are the Journal of Medical Internet Research, PLoS One, and the International 
Journal of Environmental Research and Public Health. 

Whether someone is exposed to COVID-19 disease could be detected by analyzing his/her medical 
chest image [40], [50], [51]. This image-based analysis was used for screening the type of patients’ 
respiratory disease and their severity. The results of this analysis were integrated with other patients’ 
symptoms data to fit out the diagnosis. Multimodal diagnosis applications were developed by combining the 
medical image data with breathing sound and clinical data [52]. Moreover, the severity scoring of COVID-19 
patients could be detected by analyzing their lung ultrasound videos [53]. Overall, there are more than 10% 
of articles in the dataset (391 articles) related to image-based COVID-19 diagnosis. The amount of research 
in this area increased significantly in the second year of the pandemic due to the soaring of cases. So, the 
need for rapid diagnosis with high accuracy provided by data processing techniques such as machine learning 
is necessary. The Journal of Scientific Reports, several journals under the IEEE publisher, and the Journal of 
Computers in Biology and Medicine were the top journals that published articles regarding this field of study. 
Deep learning techniques were the primary method that was dominantly applied in extracting complex 
information from chest medical images. Transfer learning was also applied in some past studies (29 articles) 
for the automatic detection of infected patients. Transfer learning is an alternative solution for the rapid 
development of smart applications, considering the rapid changes in virus mutation that can be more 
infectious and spread faster. The technique utilized a pre-trained model of a particular area, such as research 
on pneumonia, to be modified effectively for applications of a new disease, such as COVID-19 [54]. 

During the COVID-19 pandemic, machine learning has contributed to discovering medicines and 
vaccines to increase patients’ survival rates. This review study identified 71 articles that covered the field. 
Bioinformatics applications have been developed for discovering, repositioning and repurposing drugs to find 
effective clinical treatments [55], [56]. The studies analyzed the interactions between compounds of drugs 
and proteins of SARS-CoV-2 and predicted the bio-activities that occurred. Analysis of the interaction 
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between the protein of the virus against drugs was widely done in past studies. The studies elaborated on the 
biochemical features and mutations of SARS-CoV-2 proteins regarding their impacts to develop drugs [57]. 
In attempting to suppress the severity, researchers also paid attention to identifying the potential inhibitors of 
SARS-CoV-2 main protease [58]. Some artificial intelligence applications were developed to repurpose 
available drugs for devising therapeutic strategies against COVID-19 [59]. In addition to drugs development, 
the role of vaccines is very important in reducing the number of cases and the severity caused by COVID-19. 
Computational intelligence using machine learning was applied in many past studies to design vaccines for 
COVID-19 [60]. Research on vaccine development aimed to predict and mitigate mutation threats of new 
variants of SARS-CoV-2 [61]. People's reaction toward the COVID-19 vaccine became the concern of 
researchers in observing to what extent the public responded to the announcement of the vaccine [62]. 

Virus characteristics and how the human body responds to this were the primary concern of viral 
infection topics. Forty-two articles discussed the topic, consisting of 19 published in 2020 and 23 in 2021 
(until mid-year). Analytical Chemistry, the Indian Journal of Medical Research, Scientific Reports, and The 
Lancet Microbe were the most prolific source. SARS-CoV-2 has evolved to adapt the environmental changes 
through genetic mutations. Knowledge of virus evolution and transmission is essential during the pandemic 
to develop appropriate intervention strategies for the virus spreading control [63]. Several previous studies 
related to the SARS-CoV-2 genome signature evolution were: 1) identification of differences and similarities 
of viral variants based on genome sequence analysis [64]; 2) identification of protein interactions [65] and 
determination of the functions and pathways of proteins in biological processes [66]; 3) evaluation of viral 
mutation based on the protein sequence [67]; and 4) prediction of the SARS-CoV-2 mutation infectivity [68]. 

Furthermore, the severity of this SARS-CoV-2 infection is highly dependent on the host (human 
body) factors, such as age and immunity level [69]. Diagnostics of the host response to COVID-19 could be 
used for viral ribo nucleic acid (RNA) profiling [70] and identifying proteins as biomarkers, and the 
pathogenesis of COVID-19 [71]. People who have recovered from COVID-19 or received the COVID-19 
vaccine were detected to have SARS-CoV-2 antibodies in their blood. Serological diagnosis based on 
antibody response shows that antibody and immunity of post-COVID-19 infected persons are increased [44]. 
However, the antibody response and the duration of immunity in post-COVID patients vary greatly 
depending on the individual health condition [72]. 

The main concerns discussed by previous research in topic 6 included the adherence to wearing 
masks, social. The main concerns discussed by previous research in topic 6 included the adherence to 
wearing masks, social distancing, and lockdown policy regarding the COVID-19 pandemic. The topic 
contained 63 articles published in 2020 (31 articles) and 2021 (32 articles until a mid-year). PLoS ONE, 
JMIR Public Health and Surveillance, and the International Journal of Environmental Research and Public 
Health were the most prolific sources on this topic. Twenty-one past studies have exposed the detection of 
face-mask-wearing conditions for human safety in coping with the pandemic. Most of them applied deep 
learning techniques for image analysis employing various architectures such as Yolov and MobileNet 
[73], [74]. To get automatic real-time data, the face-mask detection system was integrated with an IoT system 
[75], [76]. Finally, the study on topic mining to explore the crucial insights on public discourse against 
wearing a mask was performed by [77]. Based on the user-generated content on Twitter, a topic modelling 
technique of LDA was applied to address public concern about this health protocol. The results can support 
the decision-maker in reshaping the policy reducing public risk and improving public resilience. 


4.3.2. Miscellaneous diseases 

The miscellaneous diseases cluster contains research on various types of infectious disease 
outbreaks for the last two decades. Approximately 619 of the total 3,447 articles cover topics in this cluster. 
This cluster's most widely discussed topic is an internal disease, with 198 articles. About 141 articles on 
cancer disease, 109 articles on the topic of influenza disease, 96 articles on mosquito-borne disease, and 75 
articles on HIV and drug abuse. 

Internal disease deals with various diseases and health problems affecting the human internal 
organs. Of the 198 articles, the two most discussed in internal disease are cardiovascular disease (heart and 
blood vessels) with 70 articles and endocrine disease (disorders of the endocrine system) with 51 articles. 
The rest discussed geriatrics (care of the elderly), gastroenterology (digestive system, liver, and gallbladder), 
nephrology (kidneys), and obesity which affected the health of internal organs. Cardiovascular diseases affect 
the heart and blood-vascular system. Heart attacks and strokes were recognized as the main cause (estimated 
85%) of the 17.9 million deaths in 2019 [2]. Machine learning and deep learning have been widely used to 
identify, detect, predict, and forecast several types of cardiovascular disease. Some previous studies explored 
cardiology diseases, including heart disease detection [78], prediction of heart failure [79], diagnosis of atrial 
fibrillation [80], and cardiovascular mortality risk [81]. In addition to cardiovascular disease, endocrine 
system disorders, especially diabetes mellitus, were also widely discussed. According to WHO, diabetes is 
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one of the leading causes of death in the world, with 1.6 million deaths every year [2]. Most of the cases 
occurred in low-and middle-income countries [82]. Research on diabetes using machine learning and deep 
learning approaches has been carried out for the detection of diabetes types [83], investigation of the 
association between obesity and diabetes risk [84], algorithms for image classification for detection and 
screening of diabetic retinopathy [85]. PLoS ONE was the journal that published the most articles on internal 
disease, followed by the American Journal of Cardiology, Journal of Medical Internet Research, BMJ Open 
Diabetes Research and Care, Diabetologia, and Journal of the American College of Cardiology. 

Cancer is the sixth leading cause of death globally, being the second most discussed topic of 
miscellaneous disease with 141 articles. This topic included 24 articles on carcinoma, 22 articles on breast 
cancer, 20 articles on lung cancer, 11 articles on prostate cancer, and the rest were brain cancer, leukemia, 
osteosarcoma, thyroid cancer, cervical cancer, and others. The top sources that published the topic are 
Cancers (MDPI) and the International Journal of Medical Informatics, Frontiers in Oncology, and Journal of 
the American Medical Informatics Association. Most of the articles on this topic use machine learning and 
deep learning approaches for prediction (81 articles), detection and diagnosis (16 articles). In particular, 
regarding the prediction approaches, some previous research elaborated the following predictions: patient 
survival likelihood [86], cancer prognosis [87], treatment plans [88], and metastasis (cancer spreads to a 
different body part) [89]. And the rest were the predictions of radiation risk in breast cancer [90], mortality 
risk [91], malignancy (local spreading) [92], and recurrences [93]. 

Historical data on global health shows that influenza pandemics have hit the world several times, 
infecting many people and causing many deaths. Approximately 75% of confirmed cases are caused by 
infection of the type A influenza viruses, the viruses that are capable of infecting animals [94]. Type A 
influenza viruses were highly contagious through poultry and human contact and often become epidemics in 
tropical countries. Over the past two decades, 2020 was the most prolific publication year on this topic, with 
33 articles from a total of 96 articles. Most of the articles discussed machine learning and deep learning 
approach to predict and forecast virus spread [95], the likelihood of vaccinated patients [96], and the vaccine 
effectiveness [97]. The rest articles discussed the classification of tropism protein signature for influenza 
virus identification [98], classification of symptoms for diagnosis of suspected people [99], and early warning 
detection of infected patients [100]. 

Mosquito-borne diseases are transmitted through the bite of infected mosquitoes to human bodies 
that may cause epidemics, especially in countries with tropical and subtropical climates [101]. Diseases 
carried by mosquitoes are caused by parasites (malaria) or viruses (dengue, zika, yellow fever, West Nile, 
and others). The disease has caused many deaths in several countries over the world. In the last two decades, 
dengue cases have increased more than 8-fold, from 505,430 cases and 960 deaths in 2000 to 5,2 million 
cases and 4,032 deaths in 2019 [2]. In 2019 there were 229 million cases of new malaria infection worldwide, 
with 94% occurrences in the African region [2]. The number of articles on this topic began to increase in 
2018 (25 articles) and reached 55 articles in 2020. PLoS Neglected Tropical Diseases published the most 
articles, followed by Malaria Journal and the Journal of Communications in Computer and Information 
Science. This review study identified that climate, meteorology, geography, and environment were important 
predictor factors that affected the distribution of mosquitoes as disease transmitters [102]. Thirty-nine articles 
in the dataset discussed these predictor factors. These factors were used for the prediction and forecasting of 
the number of cases, disease spread, and the impact of disease infection [103]. Further, the historical case 
data, population density, and human mobility were employed to predict the number of cases and areas at risk 
[104]. Machine learning and deep learning methods were also used for classification of infection severity 
[105], diagnosis of malaria parasites determined by the analysis of blood smears [106], detection of antibody 
responses to vaccines [107], and the role of mosquitoes as disease vectors [108]. 

HIV and drug abuse are the last topics in the miscellaneous disease cluster discussed in this 
subsection. HIV is a virus that damages the human body's immune system. The virus infections can lead to 
acquired immune deficiency syndrome (AIDS) if the infected people are not treated immediately. Around 
36.3 million human lives have been lost due to HIV. In 2020, about 1.5 million people have been infected 
[2]. Research on this topic has been conducted since 2009 for the last two decades. AIDS (London, UK), 
PLoS ONE, and BMJ are the most sourced publishing on this topic. Currently, a cure for HIV disease has not 
yet been found, making it a serious global public health problem that could lead to epidemics. Injected drug 
addicts [109] and opioid abuse [110] mainly triggered the occurrence of the HIV epidemic. Another severe 
problem in dealing with the HIV epidemic is drug resistance. The SVM classification algorithm has been 
applied to predict drug resistance, and anti-HIV-1 [111]. Other learnings and deep learning algorithms have 
been widely used to: 1) predict HIV incidence and infection [112]; 2) predict drugs addiction and overdose 
tendencies [113]; 3) detect the likelihood of suicide for opioid users [114]; and 4) identify people or areas at 
high risk of HIV infection [115]. 
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4.3.3. Public opinion on disease outbreaks 

The last cluster consisted of 232 out of 3,447 total articles discussing public concerns related to 
disease outbreaks. Research on this topic increased starting in 2011. The topic became fascinating to 
researchers due to the various public responses to health protocols when the pandemic occurred. User- 
generated contents posted on social media were the prevalent dataset used by previous research because 
social media is recognized as the fastest, easiest, and most widely used platform to share news, opinions, and 
emotions. Big data technology and NLP enabled to process of massive amounts of social media data to 
identify some interesting public concerns. Sentiment analysis could reveal public presumptions and emotions 
of people related to the COVID-19 outbreak [116], [117]. Other public opinions were widely discussed in 
this cluster were public mental health [118], vaccines and anti-vaccines opinion [119], [120], HIV prevention 
[121], fake news and misinformation detection [122], and hate speech and racial bullying related to the 
outbreak [123]. Moreover, social media data was also widely used for disease spread predictions [124], new 
cases and event detection [125], and illicit opioid and drug abuse detection [126]. 

This survey study showed that machine learning had covered many research areas in tackling 
disease outbreaks using various datasets and algorithms. Private or public medical data were employed by the 
algorithms to solve medical problems. Table 4 presents the selected machine learning approaches learned 
from this study categorized into similar data types (see in Appendix). 


5. LIMITATIONS 

We acknowledge that this work has some limitations in presenting the knowledge. First, our survey 
on the contribution of machine learning to tackle disease outbreaks is restricted to publications in the English 
language. Publications in other languages certainly should be part of the domain research body of knowledge. 
Furthermore, the articles to be reviewed were selected by criteria of pre-determined keywords and a 
particular period of publications, which leads to the possibility that more relevant articles should be 
considered in building the knowledge. Even though the Scopus citation database covers more articles 
compared with other databases such as Web of Science or PubMed, the use of only the Scopus database 
becomes another limitation. Lastly, the discussions of the topic hotspots referred to prominent keywords 
derived from the LDA process only involve articles with the most relevant and significant contribution from 
the authors’ point of view. 


6. CONCLUSION 

The study of epidemiology is progressing to protect public health and deliver the highest possible 
public health care services, enabling machine learning to contribute more to tackling many disease outbreaks. 
The emergence of new viruses and virus mutations has accelerated research in this domain since 2013. The 
trend was extremely increased when SARS-CoV-2 was detected. Traditional machine learning and deep 
learning have been widely used in previous research with various supervised, unsupervised, and 
reinforcement learning techniques. Additionally, federated learning, transfer learning, and ensemble learning 
were applied in many studies to reduce the complexity and get higher accuracy. Integrated with IoT 
technology, machine learning has encouraged the development of telehealth and telemedicine. The review 
study reveals that the scientific structure of this domain is dominated by machine learning research on 
COVID-19 diseases and miscellaneous diseases caused by pathogens or some genetic factors. A huge amount 
of multimodal medical data was used by previous studies to predict, forecast, classify, or screen resolving 
many problems of diseases, including epidemiological surveillance, diagnosis, treatment, health monitoring, 
epidemic management, viral infection, and pathogenesis. Public opinions towards new diseases are also an 
interesting topic for researchers in addition to the public perceptions in response to the health protocol and 
policies to prevent the spread of diseases. Research on epidemiology is still challenging, and bioinformatics 
applying machine learning still has many opportunities to provide solutions for health and medicine. Virus 
genomics and evolution open up to be studied. Pathogen and drug discoveries inquired to face new threats of 
diseases in the future. Nevertheless, patient management and public health require continuous improvement. 
Hence, machine learning is necessary to be harnessed in epidemiology for disease outbreak handling. 


ACKNOWLEDGEMENTS 

The authors wish to thank the members of the Information Retrieval Research Group at the Research 
Centre for Data and Information Sciences, National Agency of Research and Innovation (Indonesia) for their 
support throughout this work. 


Bulletin of Electr Eng & Inf, Vol. 11, No. 4, August 2022: 2169-2186 


Bulletin of Electr Eng & Inf 


ISSN: 2302-9285 


O 2181 


APPENDIX 
Table 4. Machine learning approaches for disease outbreaks 

No. _ Study aims Data Model/Algorithm 

Electronic medical record (EMR) dataset 

1. To predict mechanical ventilation and EMR XGBoost and CatBoost 

mortality COVID-19 patients [33] 
25 To predict physiological damage and death EMR Ensemble model (logistic regression (LR), 
up to the next 20 days [34] SVM, gradient boosting, decision tree 
(DT), and neural network) 

3. To predict cardiovascular diseases [78] Cardiac disease dataset Recursion enhanced random forest (RF)- 
improved linear model (RFRF-ILM), and 
internet of medical things (IoMT) 

4. To predict the mortality risk and heart treatment of preserved cardiac LR, RF, and gradient descent boosting 

failure in hospitalized patients [80] function heart failure with an 
aldosterone antagonist (TOPCAT) 
data 

5. To predict the fatality of acute myocardial AMI data Deep learning for AMI (DAMD, LR, and 

infarction (AMI) patients [81] RF 

6. To classify diabetes disease with a new Pima Indian Diabetes dataset EM clustering, incremental PCA and 

hybrid intelligent system [83] incremental SVM 

is To prognosis treatment decisions by Sun Yat-Sen University Cancer Deep feed-forward neural network 

analyzing pathological microscopic features Center (data on patients with (DeepSurv) 
[87] nasopharyngeal carcinoma (NPC)) 
8. To predict Hepatocellular carcinoma (HCC) Clinical data of patients with HCC Novel Bayesian network 
recurrences [93] 
9. To detect malaria-infected red blood cells National Institute of Health (NIH) VGG16 and CNN 
[105], [106] malaria dataset, Broad Institute 
malaria dataset. 
10. To identify person at high risk of HIV [115] HIV testing data from rural Kenya Super learner ensemble 
and Uganda 
Surveillance dataset 
11. To predict global or country-based COVID- Time series COVID-19 dataset ANN-grey wolf optimizer (GWO), marine 
19 cases [26], [28], [29], [31] predators algorithm (MPA)-ANFIS, 
convolutional auto encoder (CAE) and AL 
(autoencoder LSTM)-CNN, SIR model 
12. To forecast the number of beds required as Kaggle data MLP 
Cov-19 cases [47] 
13. To optimize the redistribution of critical Statistics data from Centers for LSTM 
medical supplies [49] Disease Control and Prevention 
14. To predict the survival rate of patients with the surveillance, epidemiology, and Survival tree (ST), RF, conditional 
oral and pharyngeal cancer (OPCs) [66] end results (SEER) database inference forest (CF), cox proportional, and 
hazard models 

15. To predict lymph node metastases for SEER database LR, XGBoost, k-NN, regression trees, 

colorectal cancer patients [89] SVM, ANN, and RF 

16. To predict whether a person has been The National H1N1 Flu survey MIBox, TPOT, polynomial feature, RF, 

vaccinated against H1N1 and seasonal flu (NHFS) MLP, LR, DT, XGBoost and CatBoost 
[96] 

17. To predict dengue outbreaks [102] Dengue surveillance data SVR, gradient boosting, GAM, negative 
binomial regression, LASSO, and LR 

18. To describe HIV trends and predict their Chinese Center for Disease Control LSTM, ARIMA, GRNN, and exponential 

occurrence [112] and Prevention Database smoothing 

Images Dataset 

19. To early detect COVID-19 patient [40] Chest X-ray (CXR) images ResNet and CNNet 

20. To analyze and categorize COVID-19 X- X-Ray images dataset EfficientNet, MobileNet, Xception, 

ray images [54] Inception V3, and VGG19 

21. To extract CT masks to improve the Computed tomography (CT) images Multi-agent deep reinforcement 

diagnosis of COVID-19 [51] learning 

22. to diagnose COVID-19 using breath CXR images data InceptionV3-MLP 

sounds, chest X-ray (CXR) [52] 
23. To identify the species of the mosquitoes Data images of anesthetized/dead YOLO 


gender [108] 


Voice Dataset 


24. 


To detect Covid-19 through voice analysis 
of speech or cought [41], [42] 


IoT generated/video dataset 


25. 


26. 


To apply IoT in healthcare and physical 
distance monitoring for pandemic situations 
[45] 

To detect face masks for infection spread 
control [74], [75] 


mosquito 


Coswara and virufy database 


Sensor data 


Camera-generated data 


Naive Bayes (NB), BayesNet (BN), SVM, 
stochastic gradient descent (SGD), KNN, 
locally weighted learning (LWL), RNN, 
Adaboost, Bagging algorithms, One-R, 
decision table, DT, and REPTree, 


ANFIS, DT, and SVM 


VGG-16, MobileNetV2, inception V3, 
ResNet-50, and CNN-transfer learning 
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Table 5. Machine learning approaches for disease outbreaks (continue) 
No. Study aims Data Model/Algorithm 
27. To predict the dengue incidence based on Mobile network big data ANN and XGBoost 
human mobility [104] 
Protein/genome dataset 
28. To identify the protein-ligand interactions Genome sequence dataset Feed-Forward DNN 
for a specific drug [57] 
29. To predict the anti-viral activities of Protein data bank AutoQSAR algorithm 
resultant compounds [58] 
30. To predict COVID-19 vaccine candidates NCBI GenBank Vaxign-ML 
[60] 
31. To analyses genomic mutations of SARS- NCBI GenBank RNN 
CoV-2 [63] 
32. To predicts the protein interactions between Non-structural protein data BiRNN 
SARS-COV-2 virus and human proteins 
[65] 
33. To predict the effect of mutations on Protein and CFR (case fatality rate) DeepNEU: RNN, cognitive maps (CM), 
SARS-CoV-2 infectivity [68] data SVM and genetic algorithm (GA). 
34. To improve understanding and treatment of Qatar Biobank PCA, RF, and gradient boosting 
obesity and diabetes [84] 
35. To identify possible zoonotic influenza-A Protein sequence-Influenza Research RF 
viruses [98] Database 
36. To predict anti-HIV-1 peptides [111] Amino acid sequences data SVM 


37. To detect dengue and flu outbreaks [73] Twitter data RF, KNN, SVM, and DT 

38. To detect suicidality among opioid users Reddit data RNN and CNN 
[114] 

39. To identify HIV-associated patterns from Large collections of social media data LR, RF, and ridge regression 
large sets of social data [121] 

40. To detect misleading information related to WHO, UNICEF news/report DT, KNN, LR, SVM, Multinomial NB, 
COVID-19 [122] Bernoulli NB, Perceptron NN, Ensemble 

RF, XGBoost 

41. To detect cyberbullying regarding the Instagram, Facebook, Twitter, and CNN, capsule network (CAPSNET), and 
pandemic [123] Youtube deep NN 

Fusion dataset 

42. To measure the number of emergency Integrated data: environmental data, LSTM 
ambulance required during an emergency the localization data of mobile phone 
situation [46] users, and the historical EAD data 
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